Sentryhawk - Security Data Platform

Sentryhawk is a data engineering project that aggregates public vulnerability data from NVD, cvelistV5, KEV and EPSS into a unified analytical platform. It normalizes and deduplicates records across these feeds, computes a weighted Exposure Index from CVSS scores and serves the results through interactive dashboards. The project was built to explore how scattered, heterogeneous vulnerability feeds can be structured into a consistent, queryable dataset and how derived metrics like cumulative severity scores can surface risk concentration across vendors and products that raw CVE counts alone would obscure.

Features

Interactive Dashboards: Preconfigured Superset dashboards show global vulnerability trends, vendor-specific risk rankings and product-level details. Users can filter by year or vendor and see visualizations of new CVEs and exposure scores. Dashboards are publicly sharable and support alerts/reports. The BI stack is fully open-source, providing sub-second query latency on aggregated views.
Exposure Index: A composite score aggregating CVSS severity across all known CVEs for a vendor/product. The Exposure Index highlights which vendors or products have an unusually high concentration of vulnerabilities. Teams can use it to prioritize patching or inventory audits where the index is high.
Normalization & Deduplication: The data pipeline automatically cleans and merges overlapping feeds. Inconsistent vendor/product naming conventions are harmonized so that the same entity isn't counted twice. This avoids false inflation of vulnerability counts and ensures the Exposure Index is meaningful.
Real-time KEV & EPSS Streaming Pipeline: The platform includes a real-time streaming pipeline for immediate ingestion of enrichment feeds. Apache Kafka ingests CISA's Known Exploited Vulnerabilities and EPSS exploit prediction scores through dedicated producers that monitor feed changes. KEV uses incremental updates with hash-based deduplication, while EPSS follows a full-replace strategy. Both streams are ingested into Apache Druid via Kafka supervisors, enabling real-time dashboard updates in Superset. This setup ensures immediate alerts for newly exploited vulnerabilities and updated exploit probability scores within minutes of publication.
Open Source Stack: All components are based on OSS (Python, MongoDB, Druid, Redis, Superset, etc.), with infrastructure defined in code. The result is a cost-effective, extensible system that any organization can audit and extend. All services run in Docker containers, enabling easy upgrades and community contributions.

Exposure Index

The Exposure Index is a weighted severity score that quantifies the cumulative risk burden of a vendor or product based on its known CVEs. Rather than simply counting vulnerabilities, it scales each CVE's CVSS base score by its severity tier to ensure that a Critical vulnerability contributes proportionally more weight than a Medium one.

Scoring Formula

Each CVE is assigned a weighted score derived from its CVSS base score:

Severity	Weight	Example
Critical (9.0–10.0)	1.00×	9.8 (for CVSS 9.8)
High (7.0–8.9)	0.75×	6.0 (for CVSS 8.0)
Medium (4.0–6.9)	0.50×	2.75 (for CVSS 5.5)
Low (0.1–3.9)	0.25×	0.5 (for CVSS 2.0)

The Exposure Index for a given vendor, product or time window is the sum of all weighted scores across its CVEs.

Thresholds & Classification

Exposure Index thresholds are computed dynamically at pipeline runtime using the 25th, 50th and 75th percentiles of the score distribution across all entities in a given aggregation level (global, vendor, or product). This makes the classification adaptive. Thresholds automatically recalibrate as the CVE landscape evolves. An entity scoring above the 75th percentile is flagged as high exposure, while below the 25th percentile is considered low.

Network Exposure Index

A parallel metric, the Network Exposure Index applies the same weighted aggregation but filters exclusively to CVEs with a network-based attack vector (AV:N in the CVSS vector string). This isolates remotely exploitable risk, which is typically of higher operational concern to defenders.

Aggregation Levels

The index is computed across multiple time granularities (daily, monthly, yearly) and entity scopes (global, vendor, product), as well as trailing windows (last 30 days, last 12 months) to support both trend analysis and point-in-time risk snapshots.

Data Pipeline Architecture

The deployed pipeline is fully managed in AWS and consists of these stages:

Trigger & Orchestration (EventBridge + Step Functions): An Amazon EventBridge scheduled rule (cron) triggers the data pipeline at fixed intervals. This invokes an AWS Step Functions state machine, which orchestrates all downstream processing. Step Functions handles sequencing, retry logic and error handling for the entire workflow.
Change Data Capture Engine (CDC): At the start of each run, the pipeline uses AWS Systems Manager to start an EC2 backend server. This instance runs Docker containers for MongoDB and a CVE-search service. A script on the instance fetches the latest CVE data from the NVD API and the cvelistV5 GitHub repository. A MongoDB map-diff process then determines which vendor–product combinations have new, removed, or changed CVEs. These differences are sent as messages to an Amazon SQS Product Queue. If any tasks fail to process, they go to a dead-letter queue. Unprocessed items in the DLQ are logged and an SNS alert is sent to administrator.
Data Ingestion (SQS + Lambda): Messages in the SQS queue drive the ingestion of product JSON data. AWS Lambda functions take each vendor–product message and call the external Vulnerability Lookup API to retrieve detailed CVE JSON data for that vendor/product. The raw JSON results are stored in an S3 ingestion bucket.
ETL and Data Quality (AWS Glue): The raw JSON files are then processed by a sequence of AWS Glue jobs. The jobs run in order:
- Merge JSONs: Combine the raw files into NDJSONs.
- Staging Load: Process the merged data and ingest it into a staging Iceberg table in S3.
- Data Quality Check: Validate data (e.g. schema, null checks) on the staging table.
- Production Load: Merge staging into the production Iceberg table using Slowly Changing Dimension Type 2 logic, so that historical CVE records are preserved.
If any Glue job fails, the pipeline pauses and an SNS notification is sent to the admin.
Materialized Views (Amazon EMR): After the production data is updated, the Step Function launches an Amazon EMR Spark cluster. This cluster computes pre-aggregated materialized views (for example, aggregations of CVSS scores by vendor/product, exposure indices, etc.) using Spark. The resulting views in parquet are written to S3. The EMR cluster then terminates automatically to minimize cost.
Parallel Analytics Refresh (Redshift & Druid): The pipeline executes simultaneous ingestion to both analytical engines:
- Redshift Branch: Executes stored procedures for incremental upserts with SCD Type 2 logic, maintaining cve_current table in Redshift with full history tracking and watermark-based incremental processing.
- Druid Branch: Step Function scales up the Druid EC2 instance, executes a containerized ECS task to load materialized views from S3 and then scales the instance back down for cost optimization.
This dual approach provides complementary strengths: Redshift supports complex historical analysis, SCD tracking and enterprise-grade SQL operations, while Druid enables sub-second queries for interactive Superset dashboards.

Throughout the pipeline, Amazon SNS and CloudWatch provide monitoring and alerting. Any step failure triggers an SNS alert so administrators can investigate, ensuring reliability and transparency.

Setup & Deployment

The Sentryhawk repository includes all configuration and code needed for deployment. In summary, a typical setup involves:

Launch an EC2 Host: Start an EC2 instance (e.g. Amazon Linux 2, c5.xlarge or larger) and install Docker & Docker Compose:

sudo yum update -y
sudo yum install git -y
sudo amazon-linux-extras install docker -y
sudo systemctl start docker && sudo systemctl enable docker
sudo usermod -aG docker ec2-user
# Install Docker Compose (v2 plugin)
mkdir -p ~/.docker/cli-plugins
curl -SL https://github.com/docker/compose/releases/latest/download/docker-compose-linux-x86_64 \
  -o ~/.docker/cli-plugins/docker-compose
chmod +x ~/.docker/cli-plugins/docker-compose

Also obtain an NVD API key from https://nvd.nist.gov/developers/request-an-api-key for the CIRCL CVE search component.

Configure Environment: Clone the CVE-Search pipeline repo:
```
git clone https://github.com/cve-search/CVE-Search-Docker.git
cd CVE-Search-Docker
```
Copy the provided configuration files into the root directory: docker-compose.yml, .config.yml, Dockerfile.api, cve_backfill.sh, cve_mongo_map_backup.sh, cve_sqs_log.py, etc. Populate your environment (e.g. in a .env file or via export) with the required secrets and endpoints. For example, set your NVD API key and other variables:
```
NVD_NIST_API_KEY="<YOUR_NVD_KEY>"
MONGO_URI="mongodb://mongo:27017/"
DB_NAME="cvedb"
CVELIST_COLL="cvelistv5"
NVD_COLL="cves"
MAP_NVD_COLL="map_vendor_product_nvd"
MAP_CVELIST_COLL="map_vendor_product_cvelistv5"
DELTA_QUEUE_URL="https://sqs.<REGION>.amazonaws.com/<ACCOUNT_ID>/<QUEUE_NAME>"
```
Build & Launch CVE Services: Run the Docker Compose services:
```
docker compose build --no-cache
docker compose up -d mongo redis cve_search
```
Wait a few minutes for MongoDB and Redis to start. Then execute the initial backfill to load all historical vulnerability feeds:
```
chmod +x cve_backfill.sh
nohup ./cve_backfill.sh > cve_backfill.log 2>&1 &
tail -f cve_backfill.log
```
This populates MongoDB with NVD, cvelistV5 and other CVE sources and builds the vendor–product mapping collections. Once finished, start the lookup API container:
```
docker compose up -d vendor_product_api
```
You can verify the API is running by querying http://<EC2-IP>:8000/api/search/{vendor}/{product}
Setup ETL Workflow on AWS: In the AWS console or via IaC, create: an S3 bucket for ingestion data, an SQS queue with DLQ for changed keys and an AWS Step Functions state machine using the provided JSON definition step_functions/sentryhawk_state_machine.json. Attach an IAM role granting access to S3, SQS, Glue, EMR, etc. Configure an EventBridge Scheduled rule to trigger the Step Function daily. The Step Function uses AWS Systems Manager Run Command to launch the Docker-based services (MongoDB + CVE-Search) and run the CDC job each cycle.
Data Transformation: The workflow will launch the AWS Glue jobs (also provided in glue/ scripts) to transform the NDJSON data into Parquet, run DQ checks and apply SCD Type 2 merges. After that, it fires up an EMR Spark cluster (using the emr/ scripts) to generate the final materialized view tables. Verify that the Glue and EMR steps complete successfully. SNS alerts will notify you on failure for any step.

Deploy Analytics Stack: Build the Druid ingestion image from the druid/ directory:

docker build -t cve-ingestion-druid:latest druid/cve-ingestion-druid

Tag and push this image to your ECR repository (sentryhawk_repo):

aws ecr get-login-password --region <REGION> | docker login --username AWS --password-stdin <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com
docker tag cve-ingestion-druid:latest <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/sentryhawk_repo:cve-ingestion-druid
docker push <ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/sentryhawk_repo:cve-ingestion-druid

Adjust <REGION> and <ACCOUNT_ID> for your AWS account. Next, run this image as a task on ECS to ingest the materialized views into Apache Druid. For KEV/EPSS real-time ingestion, deploy the Kafka stack using scripts in kafka-druid-ingest/ directory.

Configure Superset: Launch a PostgreSQL RDS instance for Superset's metadata. In your Docker setup, place superset_config.py (customized for the RDS host and credentials) alongside the docker-compose.yml. In .env, set the DATABASE_ variables to point Superset at the RDS (for example, DATABASE_HOST, DATABASE_USER, etc.). Then run Superset and Redis via Docker:
```
docker compose up -d superset redis
```
Initialize the Superset DB superset db upgrade, create an admin user and import the included dashboards superset import-dashboards -p dashboards.
Domain and Security: Finally, configure DNS and CDN. Point your domain (e.g. example.com) to a CloudFront distribution in front of the EC2 instance running Superset. Use ACM to attach an SSL certificate. Enable WAF rules. The documentation provides a CloudFront Function that forces www and rewrites / to the Superset dashboard home. Redact any account-specific details like IPs or ARNs when you publish your config.

Note: The above steps outline a complete deployment. All sensitive values (API keys, passwords, ARNs) should be replaced with placeholders or stored in secrets managers.

Infrastructure & Data Stack

Cloud Infrastructure (AWS)

Compute & Containers: EC2, ECS, Lambda, EMR
Storage & Databases: S3, RDS, DynamoDB, Redshift
Data Processing & ETL: Glue, Step Functions, EventBridge
Messaging & Queuing: SQS, SNS
Networking & CDN: VPC, Route 53, CloudFront, WAF
Security & Identity: IAM, KMS, Secrets Manager, ACM
Monitoring & Management: CloudWatch, Systems Manager
Infrastructure as Code: CloudFormation
Container Registry: ECR

Data & Analytics Stack

Databases: MongoDB, Apache Druid
Data Lakehouse: Apache Iceberg
Data Warehouse: Redshift
Stream Processing: Apache Kafka
Caching: Redis
Visualization: Apache Superset
Data Formats: NDJSON, Parquet
Containerization: Docker

Data Sources

Vulnerability Feeds: NVD API, cvelistV5, EPSS, KEV
Lookup Services: Vulnerability-Lookup API, CIRCL CVE-Search

Infrastructure as Code & Development

IaC Tools: Terraform, CloudFormation
Infrastructure Discovery: Former2
Scripting & Automation: Python, Bash
Developer Tools: MobaXterm, Git Bash

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
assets		assets
athena		athena
cloudfront		cloudfront
cve-search		cve-search
druid		druid
emr		emr
glue		glue
kafka-druid-ingest		kafka-druid-ingest
lambda		lambda
redshift		redshift
scripts		scripts
step_functions		step_functions
superset		superset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentryhawk - Security Data Platform

Table of Contents

Features

Exposure Index

Data Pipeline Architecture

Setup & Deployment

Infrastructure & Data Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentryhawk - Security Data Platform

Table of Contents

Features

Exposure Index

Data Pipeline Architecture

Setup & Deployment

Infrastructure & Data Stack

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages